0.0 Introduction

In this 2-part bootcamps, we’re going to work through the steps of RNA-seq data analysis. We’ll start with visualizing the raw data and doing some explorations on it, then move on to some basic quality control steps to produce cleaned and normalized expression counts, finally ending with some downstream analyses.

The part 1 will cover the content related to exploration and quality control of raw data.


Start of Part 1

1.0 Loading Packages

This is the base level package we will be needing to view and manipulate our data. We shall load the rest of the packages at the time of using the relevant functions required.

library(tidyverse)

2.0 FASTQ to BAM

The raw sequence data that you get directly from the sequencing machine is in the format of a FASTQ file. These are unmapped sequencing reads containing the information for each read generated, and is present in the following formatting:

  1. A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
  2. The sequence (the base calls; A, C, T, G and N).
  3. A separator, which is simply a plus (+) sign.
  4. The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

These sequence-read files now need to be mapped to the genome. The process of mapping or alignment results in mapped SAM or BAM files.

SAM: human-readable
BAM: binary format of a SAM file (computer-readable)

In order to generate your expression counts file, we need a BAM file. Usually, when you send samples for sequencing, the center will send you back the raw FASTQ files as well as aligned BAM files.

The steps of FASTQ alignment / BAM generation generally include:

  1. Trimming of adaptors / indexes / UMIs from the sequencing reads
  2. Downloading the genome reference file (in FASTA format .fa) to which you want to align your FASTQ file (can be downloaded from ENCODE, UCSC)
  3. Generating index files from the reference genome (this step basically segments the genome into smaller separate ‘indexes’, which are then used to align short read sequences. Given that genomes are many billion bp long, this indexing allows mapping at a faster rate using the smaller sized index files)
  4. Aligning your trimmed reads to the reference genome using the index files.

There are several software available for performing the above steps in a Unix environment, each having it’s own features which might be better suited for different data types:

  1. Trimming: Trimmomatic, CutAdapt
  2. Alignment: TopHat, Burrow-Wheeler Alignment (BWA), Bowtie2, STAR

There are many tutorials available online that work through the steps of alignment: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2019/Introduction/SS_DB/Materials/Practicals/Practical2_alignment_JK.html


3.0 BAM to Expression Count Files

Okay, once you have your BAM files, extracting the counts for your reads (i.e. how many times was a particular transcript sequenced) is pretty easy and straightforward, and you can do it in R itself.

library(Rsubread)
library(here)

##an exampled bam file from 1000 Genomes
aligned <- featureCounts("data/HG00097.mapped.ILLUMINA.bwa.GBR.exome.20130415.bam"), annot.inbuilt = "hg19", isPairedEnd = TRUE)  

##expression counts
expression_counts <- aligned$counts
An example output from running the `featureCounts`

An example output from running the featureCounts

In this bootcamp, we can directly load the formatted data of expression counts and phenotypes from the .txt files.

eDat <- read.delim(normalizePath("data/GSE157103_formatted_eDat.txt"), sep = "\t")
pDat <- read.delim(normalizePath("data/GSE157103_formatted_pDat.txt"), sep = "\t")

4.0 Exploratory Data Analysis

##several methods to view dataframes
library(kableExtra, quietly = TRUE)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
as_tibble(pDat)
## # A tibble: 126 × 16
##    ID    Sample    Age Sex   COVID ICU   APACH…¹ Charl…² Mecha…³ Venti…⁴ Hospi…⁵
##    <chr> <chr>   <int> <chr> <chr> <chr>   <int>   <int> <chr>     <int>   <int>
##  1 C1    COVID_…    39 male  yes   no         15       0 yes           0       0
##  2 C2    COVID_…    63 male  yes   no          0       2 no           28      39
##  3 C3    COVID_…    33 male  yes   no          0       2 no           28      18
##  4 C4    COVID_…    49 male  yes   no          0       1 no           28      39
##  5 C5    COVID_…    49 male  yes   no         19       1 yes          23      27
##  6 C6    COVID_…     0 male  yes   no          0       1 no           28      36
##  7 C7    COVID_…    38 fema… yes   no          0       7 no           28      42
##  8 C8    COVID_…    78 male  yes   yes        43       7 yes           0       0
##  9 C9    COVID_…    64 fema… yes   yes        31       2 yes           0       0
## 10 C10   COVID_…    62 male  yes   yes        34       1 yes           2       0
## # … with 116 more rows, 5 more variables: Ferritin_ng.ml <int>, CRP_mg.l <dbl>,
## #   Procalcitonin_ng.ml <dbl>, Lactate_mmol.l <dbl>, Fibrinogen_mg.dL <int>,
## #   and abbreviated variable names ¹​APACHEII_Score, ²​Charlson_Score,
## #   ³​Mechanical_Ventilation, ⁴​Ventilator_free_days,
## #   ⁵​Hospital_free_days_post_45_days
##equivalent to using the pipeline command
# pDat %>%
#   as_tibble()

# pDat %>%
#   kable() %>%
#   kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F, fixed_thead = T)
 
# str(pDat)
 
# as_tibble((eDat[1:10, 1:10]))

##converting the column with all the gene names to assigned row names --
##we require certain identifiers (such as gene/sample names) as column 
##and row names and not as a separate column or row for a few analyses downstream
eDat <- eDat %>%  
  column_to_rownames(var = "gene")

##equivalent to
# rownames(eDat) <- eDat$gene
# eDat <- eDat[ ,-c(1)]

pDat <- pDat %>%  
  column_to_rownames(var = "ID")

##Do the column names of our expression dataframe (our samples names) 
##match the order of the rows in the ID column of the phenotype dataframe
all(colnames(eDat) == rownames(pDat))
## [1] TRUE
##column names of pDat
colnames(pDat)
##  [1] "Sample"                          "Age"                            
##  [3] "Sex"                             "COVID"                          
##  [5] "ICU"                             "APACHEII_Score"                 
##  [7] "Charlson_Score"                  "Mechanical_Ventilation"         
##  [9] "Ventilator_free_days"            "Hospital_free_days_post_45_days"
## [11] "Ferritin_ng.ml"                  "CRP_mg.l"                       
## [13] "Procalcitonin_ng.ml"             "Lactate_mmol.l"                 
## [15] "Fibrinogen_mg.dL"
##what's the separation of patients by sex?
pDat %>% 
  dplyr::count(COVID, Sex)
##   COVID     Sex  n
## 1    no  female 13
## 2    no    male 12
## 3    no unknown  1
## 4   yes  female 38
## 5   yes    male 62
##what's the ICU status by COVID status?
pDat %>% 
  dplyr::count(COVID, ICU)
##   COVID ICU  n
## 1    no  no 10
## 2    no yes 16
## 3   yes  no 50
## 4   yes yes 50
##we can't do that for age, as it a continuous variable.

##making a new categorical variable from a continuous
pDat <- pDat %>% 
  mutate(Age_bracket = case_when(
    Age < 20 ~ "below_20",
    between(Age, 21, 40) ~ "21-40",
    between(Age, 41, 60) ~ "41-60",
    Age > 60 ~ "above_60"
  ))

##treat categorical variables as factors
pDat$Age_bracket <- as.factor(pDat$Age_bracket)

pDat %>% 
  count(COVID, Age_bracket)
##   COVID Age_bracket  n
## 1    no       21-40  3
## 2    no       41-60  6
## 3    no    above_60 17
## 4   yes       21-40 14
## 5   yes       41-60 29
## 6   yes    above_60 56
## 7   yes    below_20  1
pDat %>% 
  ggplot(aes(x = Age_bracket, fill = Age_bracket)) +
  geom_bar(stat = "count") +
  facet_grid(~COVID)

##reorder levels
pDat$Age_bracket <- fct_relevel(pDat$Age_bracket, c("below_20", "21-40", "41-60", "above_60"))  
##Let's now plot this distribution of our COVID patients by their age bracket
pDat %>% 
  ggplot(aes(x = Age_bracket, fill = Age_bracket)) +
  geom_bar(stat = "count") +
  theme_minimal() +
  facet_grid(~COVID) +
  labs(title = "Distribution of COVID Patients by Age Bracket")

##Now, let's check how protein levels vary by COVID status

##Making a new variable that only includes protein measurements

names(pDat)
##  [1] "Sample"                          "Age"                            
##  [3] "Sex"                             "COVID"                          
##  [5] "ICU"                             "APACHEII_Score"                 
##  [7] "Charlson_Score"                  "Mechanical_Ventilation"         
##  [9] "Ventilator_free_days"            "Hospital_free_days_post_45_days"
## [11] "Ferritin_ng.ml"                  "CRP_mg.l"                       
## [13] "Procalcitonin_ng.ml"             "Lactate_mmol.l"                 
## [15] "Fibrinogen_mg.dL"                "Age_bracket"
proteins <- pDat %>% 
  dplyr::select(COVID, Ferritin_ng.ml, CRP_mg.l, Procalcitonin_ng.ml, Lactate_mmol.l, Fibrinogen_mg.dL)

##We're now going to collapse this `wide` dataframe into a `long` one

proteins <- proteins %>% 
  pivot_longer(cols = 2:6, names_to = "protein", values_to = "measurement")

##plotting the spread of each of the proteins by COVID status:

proteins %>% 
  ggplot(aes(x = COVID, y = measurement, fill = COVID)) +
  geom_boxplot() +
  # geom_jitter(shape = 16, colour = "grey", alpha = 0.5, width = 0.2) +
  # scale_fill_manual(values = c("orange", "grey")) +
  facet_wrap(~protein, scales = "free") + #scales = free allows the y-axis of each plot to have variable limits
  theme_minimal()

##how about mechanical ventilation? what percentage of COVID patients needed mechanical ventilation?

pDat %>%
  # filter(COVID == "yes") %>% #1
  ggplot(aes(x = Mechanical_Ventilation, fill = Mechanical_Ventilation)) +
  geom_bar(stat = "count", width = 0.6) +
  scale_fill_manual(values = c("grey", "orange")) +
  theme_minimal() +
  facet_grid(~COVID) +
  labs(y = "Counts of patients", title = "COVID Patients")

Let’s now look at the distribution of our expression data

dim(eDat)
## [1] 19472   126
#19372 126

library(reshape2, quietly = TRUE) #similar functioning to pivot_longer
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
##melt is a function that condenses all of your data into only two columns - 
##one with the character value and one with it's corresponding numerical value
e_melt <- melt(eDat)
## No id variables; using all as measure variables
head(e_melt)
##   variable value
## 1       C1 29.01
## 2       C1  0.00
## 3       C1 17.00
## 4       C1  3.00
## 5       C1  1.00
## 6       C1  0.00
colnames(e_melt)[1:2] <- c("sample", "expression")

e_melt %>% 
  ggplot(aes(x = log2(expression), color = sample, fill = sample)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Raw Counts\n")
## Warning: Removed 544485 rows containing non-finite values (stat_density).

##We get this error - Removed 544485 rows containing non-finite values (stat_density). 
##That's because we have some genes which have an expression value of 0, which when transformed 
##to log2 give infinity as the output as log2(0) does not exist. Hence, we will apply a log2+1 
##transformation which adds a unit of 1 to all log2 counts, hence converting our log2(0) expression 
##values to 0.

e_melt <- e_melt %>% 
  mutate(log_x_1 = log2(expression + 1))

##We'll store this plot in a variable
g1 <- e_melt %>% 
  ggplot(aes(log_x_1, color = sample, fill = sample)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2(x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Raw Counts\n")

g1

##Okay, now we have an idea about how our expression data looks. Let's now 
##take a look at our samples. How do they correlate with each other?
samp_cor <- cor(eDat)

dim(samp_cor)
## [1] 126 126
##to visualize our correlations, we'll make a heatmap of the sample correlation values

library(pheatmap, quietly = TRUE)

h1 <- samp_cor %>% 
  pheatmap(clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", 
           clustering_method = "complete", cluster_rows = TRUE, cluster_cols = TRUE,
           show_colnames = FALSE, show_rownames = FALSE, 
           annotation_row = pDat[c("COVID", "Sex")], annotation_col = pDat[c("COVID", "Sex")], 
           main = "Sample Correlations"
  )

##let's add some custom colours
annot_cols <- list(COVID = c(`yes` = "grey", `no` = "orange"), 
                   Sex = c(`male` = "sea green", `female` = "purple", `unknown` = "yellow")) 

h1 <- samp_cor %>% 
  pheatmap(clustering_distance_rows = "euclidean", clustering_distance_cols = "euclidean", 
           clustering_method = "complete", cluster_rows = TRUE, cluster_cols = TRUE,
           show_colnames = FALSE, show_rownames = FALSE,  
           annotation_row = pDat[c("COVID", "Sex")], annotation_col = pDat[c("COVID", "Sex")], 
           annotation_colors = annot_cols,
           main = "Sample Correlations"
  )

5.0 Quality Control

We’ll explore a few different filtering criteria and decide on one, and then normalize our data.

#dim(eDat)

##removing sequences with RPM of 0 in all samples /
##keeping only sequences with RPM > 0 in at least 1 sample
e_fil1 <- eDat %>% 
  rownames_to_column(var = "gene") %>% 
  filter_at(vars(-gene), any_vars(. != 0)) %>% 
  column_to_rownames(var = "gene")

#dim(e_fil1)
#19472 - 18340 
# 1132 sequences/genes removed

melt_fil1 <- melt(e_fil1) 
## No id variables; using all as measure variables
melt_fil1 <- melt_fil1 %>% 
  mutate(log_x_1 = log2(value + 1))

g_fil1 <- melt_fil1 %>% 
  ggplot(aes(log_x_1, color = variable, fill = variable)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2(x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Removing Sequences with RPM of 0 in all samples", caption = "Raw counts")

g_fil1

##keeping only sequences with RPM >= 1 in all samples
e_fil2 <- eDat %>% 
  rownames_to_column(var = "gene") %>% 
  filter_at(vars(-gene), all_vars(. >= 1))  %>% 
  column_to_rownames(var = "gene")

#dim(e_fil2)
#19472 - 11860
#7612 sequences removed

melt_fil2 <- melt(e_fil2)
## No id variables; using all as measure variables
melt_fil2 <- melt_fil2 %>% 
  mutate(log_x_1 = log2(value + 1))

g_fil2 <- melt_fil2 %>% 
  ggplot(aes(log_x_1, color = variable, fill = variable)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2(x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Sequences with RPM >= 1 in in all samples", caption = "Raw counts")

g_fil2

##keeping only sequences with RPM >= 1 in 30% of samples --> depends on groups in data

# e_fil4 <- eDat %>%
#   filter(rowSums(across(where(is.numeric)) >= 1) > 38)

RPM_morethan1 <- eDat >= 1
##This gives you a matrix with boolean values. Now to get the sum of all the TRUEs in each sample

table(rowSums(RPM_morethan1))
## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##  1172   369   223   150   122   123   101    85    90    82    71    68    55 
##    13    14    15    16    17    18    19    20    21    22    23    24    25 
##    47    48    46    45    49    58    30    49    45    49    47    32    44 
##    26    27    28    29    30    31    32    33    34    35    36    37    38 
##    32    39    35    35    38    30    41    34    36    40    25    33    24 
##    39    40    41    42    43    44    45    46    47    48    49    50    51 
##    27    23    27    20    25    37    30    36    19    34    30    32    16 
##    52    53    54    55    56    57    58    59    60    61    62    63    64 
##    22    22    30    33    26    25    28    26    38    21    21    18    19 
##    65    66    67    68    69    70    71    72    73    74    75    76    77 
##    23    17    34    27    30    27    31    29    25    23    22    20    22 
##    78    79    80    81    82    83    84    85    86    87    88    89    90 
##    15    30    27    31    29    24    26    30    22    37    25    26    31 
##    91    92    93    94    95    96    97    98    99   100   101   102   103 
##    27    32    32    25    31    35    25    35    35    35    22    43    41 
##   104   105   106   107   108   109   110   111   112   113   114   115   116 
##    31    34    37    42    43    47    52    50    42    58    55    62    72 
##   117   118   119   120   121   122   123   124   125   126 
##    82    75    71    95    97   134   175   263   467 11860
##This shows that there are 1172 sequences where all samples have a RPM of less than 1. 
##And there are 11860 sequences which have a RPM pf >= 1 in all 126 samples --> the same 
##number we got for e_fil2. This matching of the results/numbers is called a sanity-check, 
##which you should be doing often i.e., making sure that the results that you are getting 
##are indeed correct and are cross-checked by some other method.

e_fil3 <- as.data.frame(rowSums(RPM_morethan1) >= 38)
##keeping only sequences which have a total of 38 TRUEs (i.e. an RPM of >= 1) values or more 

e_fil3 <- e_fil3 %>% 
  filter(.[1] == "TRUE")

e_fil3 <- eDat %>% 
  filter(rownames(eDat) %in% rownames(e_fil3))

#dim(e_fil3)
#19472 - 15754
#3718 sequences removed

melt_fil3 <- melt(e_fil3) 
## No id variables; using all as measure variables
melt_fil3 <- melt_fil3 %>% 
  mutate(log_x_1 = log2(value + 1))

g_fil3 <- melt_fil3 %>% 
  ggplot(aes(log_x_1, color = variable, fill = variable)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2RPM", y = "Density", title = "Sample Distribution - Density Plot", subtitle = "Sequences with RPM >= 1 in 30% of samples", caption = "Raw counts")

g_fil3

##plot all density plots together
library(ggpubr)
ggarrange(g_fil1, g_fil2, g_fil3)

##let's now put the total number of sequences retained after 
##each of the filtering steps into a separate dataframe  
seq_counts <- data.frame(nrow(eDat))
seq_counts$fil_1 <- nrow(e_fil1)
seq_counts$fil_2 <- nrow(e_fil2)
seq_counts$fil_3 <- nrow(e_fil3)
colnames(seq_counts) <- c("All sequences", "Sequences with RPM > 0 in at least 1 sample", "Sequences with RPM >= 1 in all samples", "Sequences with RPM >= 1 in 30% of samples")
seq_counts <- as.data.frame(t(seq_counts))
seq_counts <- seq_counts %>%
  rownames_to_column()
colnames(seq_counts)[1:2] <- c("filtering_criteria", "sequences")
seq_counts %>%
  ggplot(aes(x = filtering_criteria, y = sequences, fill = filtering_criteria)) +
  geom_bar(stat = "identity") +
  theme_minimal()

seq_counts$filtering_criteria <- c("All sequences", "Sequences with\nRPM > 0 in atleast\n1 sample", "Sequences with\nRPM >= 1 in\nall samples", "Sequences with\nRPM >= 1 in\n30% of samples")

seq_counts %>%
  ggplot(aes(x = filtering_criteria, y = sequences, fill = filtering_criteria)) +
  geom_bar(stat = "identity", width = 0.6) +
  # scale_fill_viridis_d() +
  # geom_text(aes(label = sequences), vjust = -0.8, colour = "#333333", position = position_dodge(0.65)) +
  # scale_y_continuous(expand = expansion(mult = c(0, .09))) +
  theme_minimal() +
  # theme(legend.position = "none") +
  labs(x = "Filtering Criteria", y = "No. of Sequences", title = "No. of sequences by filtering criteria")

The filtering step you choose depends upon the question you’re asking of your data:

  • Do you want to check only highly-expressed, high-confidence genes? You would then use a more stringent filtering criteria
  • Do you want to profile all possibly expressed genes? The filtering criteria would be more inclusive

We’re now going to apply the Relative-Log Expression method to normalize our data - an essential step to make sure that we curb any outliers in our data as to not overestimate highly-expressed genes and have the expression distributed normally.
RLE accounts for between-sample variation, after which we’ll scale by RPM to account for within-sample variation

library(edgeR, quietly = TRUE)

genes <- as.data.frame(row.names(e_fil2))
norm <- DGEList(counts = e_fil2, samples = pDat, genes = genes, group = pDat$COVID)
eNorm <- calcNormFactors(norm, method = "RLE") 
eNorm <- cpm(eNorm)
eNorm <- as.data.frame(eNorm)

melt_norm <- melt(eNorm) 
## No id variables; using all as measure variables
melt_norm %>% 
  ggplot(aes(log2(value), color = variable, fill = variable)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2 RPM", y = "Density", title = "Sample Distribution - Density Plot: RLE Normalized Counts", subtitle = "Sequences with RPM >= 1 in all samples")

##however, now that we normalized our data, that means that some of the 
##expression counts might have been changed to 0, as we can very well see. 
##We'll perform the same x+1 transformation again

melt_norm <- melt_norm %>% 
  mutate(log_x_1 = log2(value + 1))

g2 <- melt_norm %>% 
  ggplot(aes(log_x_1, color = variable, fill = variable)) +
  geom_density(alpha = 0.1) + 
  theme_minimal() + 
  theme(legend.position = "none") + #has to come after specifying theme
  labs(x = "log2 (x+1) RPM", y = "Density", title = "Sample Distribution - Density Plot: RLE Normalized Counts", subtitle = "Sequences with RPM >= 1 in all samples")

g2

##Raw versus normalized counts
ggarrange(g1, g2)

Let’s now plot the raw and normalized expression counts of a random gene

eDat <- eDat %>% 
  rownames_to_column(var = "gene")

random_sample <- eDat %>%
  # rownames_to_column(var = "gene")
  filter(gene == "TRIM62") %>% 
  column_to_rownames(var = "gene")

sample_from_eNorm <- eNorm %>% 
  rownames_to_column(var = "gene") %>% 
  filter(gene == "TRIM62") %>% 
  column_to_rownames(var = "gene")

row.names(random_sample)[1] <- "TRIM62_raw"    
row.names(sample_from_eNorm)[1] <- "TRIM62_norm" 

TRIM62 <- rbind(random_sample, sample_from_eNorm)

TRIM62 <- TRIM62 %>% 
  rownames_to_column(var = "TRIM62_value")

TRIM62 <- TRIM62 %>% 
  pivot_longer(cols = -c(TRIM62_value), names_to = "sample", values_to = "RPM")

TRIM62 %>% 
  ggplot(aes(x = sample, y = RPM, colour = TRIM62_value)) +
  geom_point(size = 2) +
  scale_colour_manual(values = c("forest green", "orange")) +
  theme_classic() + 
  theme(legend.position = "bottom") +
  labs(x = "Sample", y = "RPM", title = "Change in TRIM62 expression value before and after normalization", subtitle = "raw vs. RLE Normalized Counts")

We’ll now save our processed expression dataframe as a tab-delimited text file which we can load in for Part 2.

eNorm <- eNorm %>% 
  rownames_to_column(var = "gene")
write_delim(eNorm, file = normalizePath("data/eNorm.txt"), delim = "\t")

End of Part 1